Sensitivity Analysis of Nonresponse Bias in the Current Population Survey
نویسنده
چکیده
Bias because of nonresponse in surveys is difficult to assess since information about nonresponders is rarely available. Models of nonresponse are often used to estimate the bias, but they all have assumptions about the relationship between the modeled variables and the nonresponse process. Sensitivity analysis can help explore the potential impact of the different assumptions. This study looks at nonresponse in the Current Population Survey (CPS) and compares several methods for modeling nonresponse using match data from the Decennial Census as a criterion. Sensitivity analysis is used to provide insights into the models. Using these approaches, the effect of nonresponse bias on employment estimates is considered to be very small. Introduction Studying nonresponse to household surveys is difficult because of a lack of information about nonrespondents. For panel surveys information can be borrowed from other panels. Survey households may also be matched with other sources, usually administrative data (registers) or censuses. For a single administration of a survey, information can be modeled based on characteristics of those interviewed early and late in the interview process. The lateness of response (for example, the last 5 percent) can be used, since if the effort to collect the data had ended earlier, they would have been nonrespondents (Bates and Creighton, 2000; Chiu, Riddick, and Hardy, 2001). Aggregate data from other surveys could also be used to model nonresponse. In this study nonresponse bias is examined for employment status as measured by the long form of the 2000 census. The Current Population Survey (CPS) was matched to the long form, and the nonresponders from the CPS were used to estimate bias. Sensitivity analysis is used to give further insight into the potential bias. Data Sources A key source of data in this study resulted from matching person-level Census long-form data to Current Population Survey (CPS) cases. Therefore, information obtained from the Census could be used to describe nonresponse cases in the CPS. Data from the CPS were selected for February through May, 2000, to cover the response time frame for the 2000 Census long form (There were 212,914 enumerated persons with interviews or refusals in this time period. Noncontact was not analyzed in this paper). Details about the CPS can be found in Technical Paper 63. The CPS is the primary source of information on the labor force characteristics of the U.S. population. Similar estimates can be generated from the Census. However, many methodological differences may contribute to differences between the CPS and Census, which are discussed in Dixon (2004). Methods The matching process failed to match about 10 percent of the CPS household members using the Census long form. The match was less successful for those who refused the CPS interview (no match for 25 percent of refusers). Statistical analyses in Dixon (2004) indicated there was only a small effect of matching on nonresponse bias estimation, so matching won’t be studied in this paper. The variables used to model nonresponse were adapted from Groves and Couper (1998), and Dixon (2001). A model with 17 predictors and 72 interactions was examined and reduced to a model with 8 predictors and 5 interactions. The adjusted pseudo rsquare went from .23 to .20. While the goodness of fit statistics indicated there were other terms that should be added to the model, this model represented a trade-off between complexity and fit. Weighted data were used, based on the sampling and match probabilities. No substantial differences were found relative to an unweighted analysis. Similarly, no adjustment was made for sample design since the random selection of census long form respondents was expected to remove any design effect in the CPS. The variances are for the chosen sample, not for national estimates. A logistic model was used to contrast the employment status (employed or not employed?) for those who responded to the survey with those who refused the survey based on information from other panels or the Census. Sensitivity analysis Sensitivity analysis covers a very broad range of methods that attempt to model the impact of unknowns on estimates (Greenland, 1996). The unknowns are characterized by assumptions in the estimation process. In surveys this could include assumptions about the sampling frame, the sampling method, nonresponse, measurement error, distribution of the measures, to name some of the popular concerns. The method has been used to study nonresponse and missing data problems (Scharfstein and Irizarry, 2003) Simulation is a “back of the envelope” approach, where there are too many variables to fit on an envelope. The method has long been used in statistics to study the robustness of statistical methods, and similar techniques are used in sensitivity analysis. 1 Census Day was April 1, 2000. Mixture models are more specific in studying the frame and sampling characteristics. How large a difference between an RDD sample and a cell phone sample is needed to impact estimates? How could the differences in the distribution of nonrespondents impact estimates? Since we don’t usually have direct estimates of these unknowns, mixture modeling can be used to model a range of potential impacts given different assumptions about the data. Propensity models are one of the most popular ways to study nonresponse, and they will be used as the principal method for this study. The bias of an estimate is thought to be affected by the relationship between the propensity to respond and the estimate of interest. In this study I will use labor force characteristics as my example of an estimate. Bayesian methods are another popular method for studying the impact of unknowns. This includes Bayesian model averaging; where different models are combined to give a more robust estimate, and the impact of the differences of the models is viewed as a sensitivity analysis. Small area estimation is also used. By studying the impact of nonresponse across different small areas the sensitivity of the estimates for the parameters of the small area can be investigated. The current study used the previously developed models and used several sensitivity methods to study nonresponse bias in the CPS. Many potential sources of data can be used in the different models. Frame data is most often used, where the characteristics of sampled units can be used to predict nonresponse, but most models tend to be poor predictors. Panel data, where responses from other times in the survey series are used for bias estimation is often used, but those who never respond are still an unknown. Match data, where information from administrative sources (or a census) is available, are especially useful in giving information about those who never respond. Contact history, often using the last 5 percent of respondents, can substitute for nonresponse (particularly noncontact). Those who initially refuse can substitute for those who consistently refuse. Follow-up samples of nonresponders can also be used for estimating the effect. Each of the sources of information we might choose has assumptions which aren’t directly testable. This makes a sensitivity analysis potentially useful. Frame data is used in weighting adjustments for nonresponse. The models which attempt to estimate nonresponse from frame data tend to have very low predictability (R-squares of a few percent are typical). Bias measures: The relative bias provides a measure of the magnitude of the bias. Interpreted similar to a percent, it is useful in comparing bias from survey measures which are in different scales. Relative Bias Br(yr)=)/yr where Br(yr) is the relative bias with respect to the estimate, yr. The bias ratio provides an indication of how confidence intervals are affected by bias. Bias Ratio = B(yr)/Standard Error Results Simple estimates of means or proportions with their standard errors can be useful in a sensitivity analysis. Table 1 shows the unemployment estimates (from the census measures) for those who responded to the CPS (3.86), for those who refused (4.99) and the overall measure (3.88). Refusers have a higher unemployment rate than those who cooperate, but because there are so few refusers the effect on the overall estimate would only be -.02 of one percent. The difference between the responders and the overall measure gives the bias (-.02), which has a relative bias of 0.5% and a bias ratio of 0.16%. Using the estimates of bias and the standard errors in a simulation can give a rough feel for the potential impact of bias for different response rates. Figure 1 shows the increase in bias as the nonresponse rate increases (with confidence intervals). At the left hand bottom corner is where the survey currently is, with about 5% refusal and very small relative bias. As the refusal rate increases, the bias grows linearly, and isn’t likely to be a problem (arbitrarily at the 10% relative bias level) until around the 10% refusal rate. If the true bias were higher (1 standard error) the problem would be at the 7% refusal rate. It is important to estimate the relationship between unemployment and refusal to see where the problem might be. This simulation assumes that the nonresponse differences are constant across the different response rates. If the rates were not constant, then the curve could look very different. If the difference between respondents and nonrespondents was only large for those most likely to refuse, then the response rate would have little effect on the estimates, and the curve in Figure 1 would asymptote quickly. Census and CPS Panel data for refusers A logistic model using only refusers who matched the Census was used to compare the difference between those who had CPS panel data and those who only had Census data (Table 2). An indicator for Census/CPS was used as the dependent variable. Separate models for seventeen variables, which had been found related to refusals, were used as independent variables. An additional model was used with all the variables as simultaneous predictors to assess their unique relationship. Hispanic members were more likely to be in the Census only (Odds ratio:4.2771), but only when adjusted for the other variables, as were homeowners (1.0288). Refusers from multiple unit structures (MUL) and larger households (NUM) were more likely to be in the Census only, and never respond to the survey. “Relatives present” were less likely to be in the Census only, as were Male, Black and White refusers. Table 3 shows the model for the propensity to refuse based on panel data. The resulting propensity score and it’s 95% confidence intervals were classified as “refusers” or “responders” based on the proportions which refuse in the CPS sample. The resulting classifications were used to estimate the bias in the unemployment estimates. The use of the 95% confidence intervals provides some adjustment for what is known about refusal. Figure 2 compares estimates of unemployment using different sources. The left bar is the Census estimate of unemployment with confidence intervals, the middle bar is for respondents, and the right bar is for estimates from respondents based on refusal propensity, but this doesn’t adjust for how much we don’t know about refusers. These models give essentially the same estimates, as do other sources of data for the propensity model. Figure 3 shows the predictive model for unemployment with the confidence bounds based on the predictive model for nonresponse. Because we predict nonresponse so much more poorly than we predict unemployment, the estimates of bias are much more sensitive to what we don’t know about the nonrespondents refusal propensity. Adding uninformative variables to the models produced some suppressor effects, giving even wider intervals. Even adding census tract level mail return rates to the refusal models didn’t improve the intervals. Discussion The current work examined the sensitivity of refusal bias estimates using person-level match data. The estimates were sensitive to the propensity model for refusal compared to the standard errors of estimating unemployment. While the bias was very small based on the census match data and the estimation based on a propensity model was very close to that value, the adjustment for nonresponse propensity was larger. The standard error was only 0.041% of the estimate for employment, but the difference between the confidence lines was 4.5% of the estimate. These are both small but the ability to estimate nonresponse propensity more precisely would be very helpful in building confidence in the estimates. Adding information about nonresponse to the 2000 census mail form didn’t improve the nonresponse propensity model. Limitations and future research Additional work needs to be invested in studying noncontact. The relationship between personal characteristics and household and interview characteristics could be modeled with multilevel models (Dixon and Tucker, 2000; Fraboni, Rosina, Orsini, and Baldazzi, 2002). Additional methods of estimating bias (e.g., benchmarking) would be useful to evaluate. Improving the nonresponse propensity estimation might be accomplished by adding questions about the respondents’ perception of the survey to assess their reluctance to respond. If the reasons for nonresponse operate on a continuum, then this might better measure the propensity. If nonresponders are of a different type from responders, then estimation will prove very difficult. Since the pattern of the relationship between the nonresponse propensity and the estimates is crucial to the bias estimation, other methods of exploring that relationship may be helpful. Quantile regression could provide some insight into the shape of that relationship. If the curve is flat, then higher propensities wouldn’t be any more biasing then lower propensities. If the curve rises very sharply at higher propensities, then the biases would be difficult to estimate, and sensitivity analysis would be likely to produce very wide bounds.
منابع مشابه
Using Contact History Information to Adjust for Nonresponse in the Current Population Survey
The Current Population Survey (CPS) adjusts the sampling weights for nonresponse to match population controls based on cells which combine similar primary sampling units (PSU) based on size and urbanicity. The adjustment method assumes that the nonresponse is random within the adjustment cells. This adjustment increases weights for responding units in PSUs with higher nonresponse. The present s...
متن کاملPractice of Epidemiology Estimating Nonresponse Bias in a Telephone-based Health Surveillance Survey in New York City
Despite concerns about nonresponse bias due to decreasing response rates, telephone surveys remain a viable option for conducting local population-based surveillance. However, this becomes problematic for urban populations, which typically have higher nonresponse rates. Unfortunately, traditional methods of evaluating nonresponse bias pose challenges for public health practitioners due to high ...
متن کاملEstimating nonresponse bias in a telephone-based health surveillance survey in New York City.
Despite concerns about nonresponse bias due to decreasing response rates, telephone surveys remain a viable option for conducting local population-based surveillance. However, this becomes problematic for urban populations, which typically have higher nonresponse rates. Unfortunately, traditional methods of evaluating nonresponse bias pose challenges for public health practitioners due to high ...
متن کاملOn the impact of nonresponse in logistic regression: application to the 45 and Up study
BACKGROUND In longitudinal studies, nonresponse to follow-up surveys poses a major threat to validity, interpretability and generalisation of results. The problem of nonresponse is further complicated by the possibility that nonresponse may depend on the outcome of interest. We identified sociodemographic, general health and wellbeing characteristics associated with nonresponse to the follow-up...
متن کاملNonresponse prediction in an establishment survey using combination of statistical learning methods
Nonrespose is a source of error in the survey results and national statistical organizations are always looking for ways to control and reduce it. Predicting nonrespons sampling units in the survey before conducting the survey is one of the solutions that can help a lot in reducing and treating the survey nonresponse. Recent advances in technology and the facilitation of complex calculations...
متن کامل